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Abstract 


We propose a non-parametric anomaly de¬ 
tection algorithm for high dimensional data. 
We score each datapoint by its average K- 
NN distance, and rank them accordingly. We 
then train limited complexity models to im¬ 
itate these scores based on the max-margin 
learning-to-rank framework. A test-point is 
declared as an anomaly at a-false alarm level 
if the predicted score is in the a-percentile. 
The resulting anomaly detector is shown to 
be asymptotically optimal in that for any 
false alarm rate a, its decision region con¬ 
verges to the a-percentile minimum volume 
level set of the unknown underlying density. 
In addition, we test both the statistical per¬ 
formance and computational efficiency of our 
algorithm on a number of synthetic and real- 
data experiments. Our results demonstrate 
the superiority of our algorithm over existing 
AT-NN based anomaly detection algorithms, 
with significant computational savings. 


1 Introduction 


Anomaly detection is the problem of identifying sta¬ 
tistically significant deviations in data from expected 
normal behavior. It has found wide applications 
in many areas such as credit card fraud detection, 
intrusion detection for cyber security, sensor net¬ 


works and video surveil lance Chandola et al. , 20091 


,Hodge and Austinl . 12004 1. 


In c lassical parametric methods |Basseville et al 
1993l| for anomaly detection, we assume the existence 
of a family of functions characterizing the nominal den¬ 
sity (the test data consists of examples belonging to 
two classes-nominal and anomalous). Parameters are 
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then estimated from training data by minimizing a loss 
function. While these methods provide a statistically 
justifiable solution when the assumptions hold true, 
they are likely to suffer from model mismatch, and 
lead to poor performance. 

We focus on the non-parametric approach, with a view 
towards minimum volume (MV) set estimation. Given 
a G (0,1), the MV approach attempts to find the set of 
minimum volume which has probability mass at least 
1 — a with respect to the unknown sample probability 
distribution. Then given a new test point, it is de¬ 
clared to be consistent with the data if it lies in this 
MV set. 


Approaches to the MV set estima tion problem include 
estimating density level sets Nunez-Garcia et al 


I 2 OO 3 I ICuevas and Rodriguez-Gasffi . [2003^ or estimat¬ 


ing the boundary of th e MV set Scott and Nowakl . 
2006 . Park et al. . 2010l| . However, these approaches 


suffer from high sample complexity, and therefore 
are statistically unstable using high dimens ional data. 
The authors of Zhao and SaligramaL l2009l| score each 
test point using the AT-NN distance. Scores turn out 
to yield empirical estimates of the volume of mini¬ 
mum volume level sets containing the test point, and 
avoids comp uting any h i gh dimensional qua n tities . 
The papers Herd . 2006 . Sricharan and Hero . 2011 
also take a AT-NN based approac h to MV set anomaly 
detection. The second paper Sricharan and Herd . 


20111 improves upon the computational performance 


of iHerd. I2006II. However , the test stage runtime of 


Sricharan and Herd . l201lj is of order 0(dn), d being 


the ambient dimensi on and n the sample size. The 
test stage runtime of [Zhao and Saligrain^ . l2009l| is of 
order 0{dv? + log(n)). 


Gomputational inefficiencies of these AT-NN based 
anomaly detection methods suggests that a different 
approach based on dis t ance- based (DB) outlier meth¬ 
ods (see |Orair et all . l201(lf and references therein) 
could possibly be leveraged in this context. DB meth¬ 
ods primarily focus on the computational issue of iden¬ 
tifying a pre-specified number of L points (outliers) 
with largest AT-NN distances in a database. Outliers 
are identified by pruning examples with small A'-NN 
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distance. This works particularly well for small L. 

In contrast, for anomaly detection, we not only need 
an efficient scheme but also one that takes training 
data (containing no anomalies) and generalizes well in 
terms of AUC criterion on test-data where the num¬ 
ber of anomalies is unknown. We need schemes that 
predict “anomalousness” for test-instances in order 
to adapt to any false-alarm-level and to characterize 
AUCs. One possible way to leverage DB methods is 
to estimate anomaly scores based only on the L iden¬ 
tified outliers but this scheme generally has poor AUC 
performance if ther e are a sizable fraction of anoma¬ 
lies. In this context Liu et all 120081 iTing et ah . 20I0l 
Sricharan and Herol 201 ll p ropose to utilize ORCA 


Bay and Schwabacher , l^03l| . ORCA is a well-known 


ranking DB method that provides intermediate esti¬ 
mates for every instance in addition to the L outliers. 
They show that while for small L ORCA is highly 
efficient its AUC performance is poor. For large L 
ORCA produces low but somewhat meaningful AUCs 
but can be computationally inefficient. A basic rea¬ 
son for this AUC gap is that although such rank-based 
DB techniques provide intermediate KNN estimates 
& outlier scores that can possibly be leveraged, these 
estimates/scores are often too unreliable for ano maly 
detection purposes. Recently, Wang et all 12011 1 have 
considered strategies based on LSH to further speed 
up rank based DB methods. Our perspective is that 
this direction is somewhat compl ementary. Indeed, we 
could also employ Kernel-LSH Kulis and Graumanl . 
2009l| in our setting to further speed up our computa¬ 
tion. 


point, ?7 S is given, and test whether rj follows the 
distribution of x. If / denotes the density of this new 
(random) data point, then the set-up is summarized 
in the following hypothesis test: 

Ho - f = fo vs. Hi: f ^ fo. 

We look for a functional D : —>■ R such that D{r]) > 

0 rj nominal. Given such a D, we define its 

corresponding acceptance region hy A = {x : D{x) > 
0}. We will see below that D can be defined by the 
p-value. 

Given a prescribed significance level (false alarm level) 
a G (0, 1 ), we require the probability that 77 does not 
deviate from the nominal (77 G A), given Ho, to be 
bounded below by 1 — a. We denote this distribution 
by P (sometimes written P(not Hi\Ho)): 

P{A) = f fo{x) dx > 1 — a. 

JA 

Said another way, the probability that 77 does deviate 
from the nominal, given Ho, should fall under the spec¬ 
ified significance level a (i.e. 1 — P{A) = P{Hi\Ho) < 
a). At the same time, the false negative, J^f{x) dx, 
must be minimized. Note that the false negative is the 
probability of the event rj £ A, given Hi. We assume 
/ to be bounded above by a constant C, in which case 
I A /(^) — C*-A(A), where A is Lebesgue measure on 
R'^. The problem of finding the most suitable accep¬ 
tance region. A, can therefore be formulated as finding 
the following minimum volume set: 


In this paper, we propose a ranking based algorithm 
which retains the statistical complexity of existing K- 
NN work, but with far superior computational perfor¬ 
mance. Using scores based on average ATNN distance, 
we learn a functional predictor through the pair-wise 
learning-to-rank framework, to predict p-value scores. 
This predictor is then used to generalize over unseen 
examples. The test time of our algorithm is of order 
0{ds), where s is the complexity of our model. 

The rest of the paper is organized as follows. In Sec¬ 
tion 2 we introduce the problem setting and the moti¬ 
vation. Detailed algorithms are described in Section 3 
and 4. The asymptotic and finite-sample analyses are 
provided in Section 5. Synthetic and real experiments 
are reported in Section 6 . 

2 Problem Setting & Motivation 


Ui-a ■= argmin < A(A) : / fo{x)dx>l — a>. 

A { Ja J 


( 1 ) 


In words, we seek a set A which captures at least a 
fraction 1 — a of the probability mass, of minimum 
volume. 

3 Score Fnnctions Based on K-NNG 


In this section, we briefly review an algorithm 
using score functions based on nearest neighbor 
graphs for determ ini ng m in imum volume sets 
Zhao and Saligrama . 20091 lOian and Saligrain^ 


2012l |. Given a test point rj ^ f, define the p-value of 
77 by 

pijj) := P{x : foix) < foiv)) = [ foix) dx. 

•J {x:fo{x)<fo{r])} 


Let X = {xi,X 2 , ■■■,Xn} be a given set of nominal d- 
dimensional data points. We assume x to be sampled 
i.i.d from an unknown density fo with compact sup¬ 
port in R'^. The problem is to assume a new data 


Then, assuming technical co nditions on the density fo 
Zhao and Saligrama . 2009l| . it can be shown that p 


defines the minimum volume set: 


Ui-a = {x : p{x) > a}. 
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Thus if we know p, we know the minimum volume 
set, and we can declare anomaly simply by checking 
whether or not p(r]) < a. However, p is based on 
information from the unknown density /o, hence we 
must estimate p. 

Set d{x, y) to be the Euclidean metric on Given a 
point X S we form its associated K nearest neigh¬ 
bor graph (K-NNG), relative to x, by connecting it to 
the K closest points in x\{x}. Let denote the 

distance from x to its Hh nearest neighbor in x \ {x}. 
Set 

1 ^ 

Now define the following score function: 


4 Anomaly Detection Algorithm 

In this section we describe our rank-based anomaly 
detection algorithm (RankAD), and discuss several of 
its properties and advantages. 


Algorithm 1: RankAD Algorithm 


1. Input: Nominal training data x = {xi,X 2 , ...,x„}, 
desired false alarm level a, and test point rj. 

2 . Training Stage: 

(a) Galculate ATth nearest neighbor distances Gx(xi), 
and calculate (xi) for each nominal sample Xi , using 
Eq.@ and Eq.®. 




{Gx(r,)<Gx(x.)} 


2=1 


( 3 ) 


This function measures the relative concentration 
of point r? compared to the training set. In 
Qian and Saligram^ l2012l | , given a pre-defined signif¬ 


icance level a (e.g. 0.05), they declare rj to be anoma¬ 
lous if Rnip) < a. This choice is motivated by its 
close conne ction to multivariate p-val ues. Indeed, it 
is shown in [Qian and SaligramaL 2012l| that this score 
function is an asymptotically consistent estimator of 
the p-value: 


lim i?„(ry) = p{r]) a.s. 

n—foo 


(b) Quantize {i?„(xi), i = l,2,...,n} uniformly into 
m levels: rq(xi) S {I,2,...,m}. Generate preference 
pairs (i, j) whenever their quantized levels are differ¬ 
ent: rq{Xi) > rq{Xj). 

(c) Set V = {{i,j) : rq{x^) > rq{xj)}. Solve: 

min: I||g||ii + C ^ (5) 

s.t. {g, 4(xi) - $(xj)) > 1 - G V 

> 0 

(d) Let g denote the minimizer. Gompute and sort: 
gi:) = {5,®(-)) on X = {xi,X 2 ,...,x„}. 


This result is attractive from a statistical viewpoint, 
however the test-time complexity of the A-NN dis¬ 
tance statistic grows as 0{dn). This can be prohibitive 
for real-time applications. Thus we are compelled to 
learn a score function respecting the A-NN distance 
statistic, but with significant computational savings. 
This is achieved by mapping the data set x into a re¬ 
producing kernel Hilbert space (RKHS), H, with ker¬ 
nel k and inner product (•,•). We denote by $ the 
mapping —>• A, defined by $(xi) = k(xi,-). We 

then optimally learn a ranker g € H based on the or¬ 
dered pair-wise ranking information, 

{(*, j) : Gx(x*) > Gx(xj)} 
and construct the scoring function as 


3. Testing Stage: 

(a) Evaluate 5 ( 77 ) for test point ry. 

(b) Gompute the score: Rn{v) = ^ ELi 

This can be done through a binary search over sorted 
{gixi),i = 1 , ...,n}. 

(c) Declare p as anomalous if Rn{p) < ol. 


Remark Ij _ The standard learning-to-rank setup 

Joachiml l2002 | is to assume non-noisy input pairs. 


Qur algorithm is based on noisy inputs, where 
the noise is characterized by an unknown, high¬ 
dimensional distribution. Yet we are still able to show 
the asymptotic consistency of the obtained ranker in 
SecEl 


1 " 

Rn{r]) - E 


^(')))<(9A(a:i))}- 


( 4 ) 


It turns out that R is an asymptotic estimator of the 
p-value (see Section[5]) and thus we will say a test point 
p is anomalous if R{p) < a. 


Remark 2: Eor the learning-to-rank step Eq.(IS]), we 
equip the RKHS H with the RBE kernel k(x,x') = 


exp 


The algorithm parameter G and 


RBE kernel bandwidth tr can be selected through cross 
validation, since this step is a supervised learning pro¬ 
cedure based on input pairs. We use cross valida- 
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tion and adopt the weighted pair wise disagreement loss 
(WPDL) from Lan et ah . 2C)12l| for this purpose. 


Remark 3: The number of quantization levels, m, 
impacts training complexity as well as performance. 
When m = n, all ( 2 ) preference pairs are generated. 
This scenario has the highest training complexity. Fur¬ 
thermore, large m tends to more closely follow rank¬ 
ings obtained from K-NN distances, which may or may 
not be desirable. iF-NN distances can be noisy for 
small training data sizes. While this raises the ques¬ 
tion of choosing m, we observe that setting m to be 
3^5 works fairly well in practice. We fix m = 3 
in all of our experiments in SeclHl m = 2 is insuffi¬ 
cient to allow flexible false alarm control, as will be 
demonstrated next. 


Remark 4: Let us mention the connection with 
ranking SVM. Ranking SVM is an algorithm for the 
learning-to-rank problem, whose goal is to rank un¬ 
seen objects based on given training data and their 
corresponding orderings. Our novelty lies in building 
a connection between learning-to-rank and anomaly 
detection: 

(1) While there is no such natural “input ordering” 
in anomaly detection, we create this order on training 
samples through their iF-NN scores. 

(2) When we apply our detector on an unseen object 
it produces a score that approximates the unseen ob¬ 
ject’s p-value. We theoretically justify this linkage, 
namely, our predictions fall in the right quantile (The¬ 
orem 13). We also empirically show test-stage compu¬ 
tational benefits. 


4.1 False alarm control 

In this section we illustrate through a toy example how 
our learning method approximates minimum volume 
sets. We consider how different levels of quantization 
impact level sets. We will show that for appropriately 
chosen quantization levels our algorithm is able to si¬ 
multaneously approximate multiple level sets. In Sec¬ 
tion [S] we show that the normalized score Eq.(|3, takes 
values in [ 0 , 1 ], and converges to the p-value function. 
Therefore we get a handle on the false alarm rate. So 
null hypothesis can be rejected at different levels sim¬ 
ply by thresholding 

Toy Example: 

We present a simple example in Fig. 1 to demon¬ 
strate this point. The nominal density / ^ 

0.5A/'([4;l],0.5/)-h0.5A/'([4;-l],0.5/). We first con¬ 
sider single-bit quantization (m = 2) using RBF ker¬ 
nels (ct = 1.5) trained with pairwise preferences be¬ 
tween p-values above and below 3%. This yields a 
decision function P 2 (')- The standard way is to claim 
anomaly when 52 (a^) < 0 , corresponding to the out¬ 


most orange curve in (a). We then plot different level 
curves by varying c > 0 for 52 ( 2 :) = c, which appear 
to be scaled versions of the orange curve. While this 
quantization appears to work reasonably for a-level 
sets with a = 3%, for a different desired a-level, the 
algorithm would have to retrain with new preference 
pairs. On the other hand, we also train rankAD with 
m = 3 (uniform quantization) and obtain the ranker 
gsi’)- We then vary c for gsix) = c to obtain vari¬ 
ous level curves shown in (b), all of which surprisingly 
approximate the corresponding density level sets well. 
We notice a significant difference between the level sets 
generated with 3 quantization levels in comparison to 
those generated for two-level quantization. In the ap¬ 
pendix we show that g(x) asymptotically preserves the 
ordering of the density, and from this conclude that 
our score function Rn{x) approximates multiple den¬ 
sity level sets (p-values). Also see Section [5] for a dis¬ 
cussion of this. However in our experiments it turns 
out that we just need m = 3 quantization levels in¬ 
stead of m = u (( 2 ) pairs) to achieve flexible false 
alarm control and do not need any re-training. 


4.2 Time Complexity 


For training, the rank computation step requires com¬ 
puting all pair-wise distances among nominal points 
0{dn^), followed by sorting for each point 0{n^ logu). 
So the training stage has the total time complexity 
0{n'^{d + log n) +T), where T denotes the time of the 
pair-wise learning-to-rank algorithm. At test stage, 
our algorithm only evaluates g{r]) on rj and does a 
binary search among g{xi),... ,g{xn)- The complex¬ 
ity is 0{ds + logn), where s is the number of sup¬ 
port vectors. This has some similarities with one- 
class SVM where the co mplexity scales with th e num¬ 
ber of support vectors [Scholkopf et al.l . [200l|. Note 
that in contrast nearest neigh bor-based algorithms , 
K-LPE, aK-LPE or BP-jF-NNG [Zhno_ajid_Sajjg ;rama. 


2009, Qian and Sahgram3 I 2 OI 2 I 


Sricharan and Hero 


2011] . require 0(nd) for testing one point. It is worth 
noting that s < n comes from the “support pairs” 
within the input preference pair set. Practically we 
observe that for most data sets s is much smaller than 
n in the experiment section, leading to significantly 
reduced test time compared to aK-LPE, as shown in 
Table. 1. It is worth mentioning that distributed tech- 
niques for speeding up computation of AT-NN distances 


Bhaduri et al.l . l201l] can be adopted to further reduce 


test stage time. 


5 Analysis 

In this section we present the theoretical analysis of 
our ranking-based anomaly detection approach. 
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(a) Level curves (m = 2) (b) Level curves (m = 3) 

Figure 1: Level curves of rankAD for different quantization levels. 1000 i.i.d. samples are drawn from a 2-component 
Gaussian mixture density. Left figure(a) depicts performance with single-bit quantization (m = 2). To learn rankAD 
we quantized preference pairs at 3% and a = 1.5 in our RBF kernel. Right figure(b) shows rankAD with 3-levels of 
quantization and a = 1.5. (a) shows level curves obtained by varying the offset c for 32 ( 2 ;) = c. Only the outmost curve 
(c = 0 ) approximates the oracle density level set well while the inner curves (c > 0 ) appear to be scaled versions of 
outermost curve, (b) shows level curves obtained by varying c for § 3 ( 2 :) = c. Interestingly we observe that the inner most 
curve approximates peaks of the mixture density. 


5.1 Asymptotic Consistency 

As mentioned earlier in the paper, it is shown in 


Qian and Saligrama . 2012l| that the average A-NN 


distance statistic converges to the p-value function: 


Theorem 1. With K = 
lim„^.oo Rniv) =p{jl)- 


have 


The goal of our rankAD algorithm is to learn the order¬ 
ing of the p-value. This theorem therefore guarantees 
that asymptotically, the preference pairs generated as 
input to the rankAD al gorithm are reliable. Note th at 
the definition of G in Qian and SaligramaL | 2012 | is 
slightly different than the one given in equation Q. 
However, for our purposes this difference is not worth 
detailing. 


What we claim in this paper, and prove in the ap¬ 
pendix, is the following consistency result of our 
rankAD algorithm. Note that the use of quantiza¬ 
tion (c.f. Section n does not affect the conclusion of 
this theorem, hence we assume there is none. Indeed, 
quantization is a computational tool. From a statisti¬ 
cal asymptotic consistency perspective quantization is 
not an issue. 

Theorem 2. With K — asn ^ 00 , Rnirf) —t 

P{il)- 


The difficulty in this theorem arises from the fact that 
the score, Rn{r]), is based on the ranker, g, which is 


learned from data with high-dimensional noise. More¬ 
over, the noise is distributed according to an unknown 
probability measure. For the proof of this theorem, 
we begin with the law of large numbers. Suppose 
for any n > 1, a function G is found such that 
f{xi) < f{xj) G{xi) < G{xj). Note that in 

Section [3] we use A-NN distance surrogates which re¬ 
verses the order but the effect is the same and should 
not cause any confusion. Then it can be shown that 

1 " 


Thus we wish to prove that the output of our rankAD 
algorithm is such a function. 


The first step in our proof is to show that t he solution 
to ou r rankAD algorithm, g, is consistent Steinwartl . 


2001] . Fix an RKHS H on the input space X C 


with RBF kernel k. We denote by L the hinge loss. We 
may write g as the solution to the following regularized 
minimization problem. 


g = argminAL,T(/) + 


2 


where TZl^U) = ^ T denotes 

the pairs from the sample x = {xi,... ,Xn}, so this 
is a loss with respect to the empirical measure. The 
expected risk is denoted 


RlAI) = E^iRLAf)]- 
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Then consistency means that, under appropriate con¬ 
ditions as A„ —>■ 0 and n — >■ cxd (see appendix), we have 

E^[nL,T{g)]^Y^mnL,p{f)- (6) 

The proof of this claim requires a concentration of 
measure result relating 7 ^l,t(/) to its expectation, 
E-L.pif), uniformly over f G H. The argum e nt fol - 
lows closely that made in [Cucker and Smal3 . 12001 1. 
except we make use of McDiarmid’s inequality. 

Finally we show that if g satisfies (jO)), then it ranks 
samples according to their density: f{xi) > f{xj) => 
g{xt) > g{xj). 

5.2 Finite-Sample Generalization Result 

Based on a sample {xi,... ,Xn}, our approach learns a 
ranker gn, and computes the values gn{xi ),..., gnix„). 
Let gn^ < gn'^ < • ■• < g^'^ be the ordered permuta¬ 
tion of these values. For a test point g, we evalu¬ 
ate gniv) and compute Rnijf)- For a prescribed false 
alarm level a, we define the decision region for claim¬ 
ing anomaly by 

Ra = {x : Rn(x) < a\ 

n 

= ^ l{9n(a:)<s„(a;j)} < Ctu} 

i=i 

= {x : gnix) < gl°‘^'^} 


Remarks 

(1) To interpret the theorem notice that the LHS 

is precisely the probability that a test point drawn 
from the nominal distribution has a score below the 
a ~ percentile. We see that this probability 

is bounded from above by a plus an error term that 
asymptotically approaches zero. This theorem is true 
irrespective of a and so we have shown that we can 
simultaneously approximate multiple level sets. 

(2) A similar inequality holds for the event giving a 

lower bound on g{x). Flowever, let us emphasize that 
lower bounds are not meaningful for our context. The 
ranks g^^'> < < • • • < 5^"^ are sorted in increasing 

order. A smaller g{x) signifies that x is more of an 
outlier. Points below the lowest rank g^^^ correspond 
to the most extreme outliers. 


6 Experiments 


In this section, we carry out point-wise anomaly 
detection experiments on synthetic and real- 
world data sets. We compare our ranking-based 
approac h against density- ba sed methods BP-AT- 


NF1G^_ Sricharan^ndHerd, l201lj and aK-LPE 


Qian and Saligrama . 2012l| . and two other state- 


of-art metho ds based on r andom sub-sampling, 
isolated forest iLiu et al.l . l2008l| (iFor est) and massAD 
Ting et ak . 201Cll| . One-class SVM [Schdlkopf et al.l . 


200 is included as a baseline. 


where [an] denotes the ceiling integer of an. 

We give a finite-sample bound on the probability that 
a newly drawn nominal point g lies in Ra. In the 
following Theorem, R denotes a real-valued function 
class of kernel based linear functions equipped with 
the ioo norm over a finite sample x = {xi,... ,Xn}. 

ll/II^So =max|/(g:)|. 

Note that T contain solutions to an SVM-type prob¬ 
lem, so we assume the output of our rankAD algo¬ 
rithm, gn, is an element of R. We let N{^,F,n) de¬ 
note the covering number of J- with respect to this 
norm (see appendix for details). 

Theorem 3. Fix a distribution P on and suppose 
xi,...,Xn are generated iid from P. For g G P let 
giP < gtP < ■ ■ ■ < he the ordered permutation 
of g(xi ),..., g{xn). Then for such an n-sample, with 
probability 1 — <5, for any gGP, l<m<n and 
sufficiently small 7 > 0 , 

P jx : g{x) < 5 '^™^ - 27 | < ^^ ^ + e(n, k, S), 

where e{n, k,S) = ^{k + log j), k = [logA/’( 7 , 2n)]. 


6.1 Implementation Details 


In our simulations, the Euclidean distance is used as 
distance metric for all candida te methods. Eor one- 
class SVM the lib-SVM codes Chang and Lin . Eoilj 
are used. The algorithm parameter and the RBF ker¬ 
nel parameter for one-cl ass SVM are set using the 
same configuration as in Ting et ak . 201Cll |. For iFor¬ 
est and mass AD, we use the codes from the web- 
site s of the au t hors, with the same configuration as 


Ting et ak . 201Cll| . For aK-LPE we use the av¬ 


erage A:-NN distance Eq.([2]) with fixed k = 20 since 
this appears to work better tha n the actual K-NN 
distance of Zhao and Saligrama . l2009j |. Note that 
this is also suggested by the conv ergence analysis in 


Thm 1 [Qian and Saligramal 12012] . Eor BP-iC-NNG, 
the same k is used and other pararn eters are set ac¬ 
cording to Sricharan and Heroll2011 |. 


For our rankAD approach we follow the steps de¬ 
scribed in Algorithm 1. We first calculate the ranks 
Rn{xi) of nominal points according to Eq.(3) based 
on aAT-LPE. We then quantize Rnixi) uniformly into 
to= 3 levels rq(xi) G {1,2, 3} and generate pairs {i, j) G 
V wh enever r„{xf > r„{xf . We adapt the routine 
from Chaoelle and Keerthi l201flj | and extend it to a 
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kernelized version for the learning-to-rank step Eq.©. 
The trained ranker is then adopted in Eq.(4) for test 
stage prediction. We point out some implementation 
details of our approach as follows. 


(1) Resampling: We follow Qian and Saligrama , 2012l| 
and adopt the U-statistic based resampling to compute 
aK-LPE ranks. We randomly split the data into two 
equal parts and use one part as “nearest neighbors” to 
calculate the ranks (Eq. ([21 [3]) ) for the other part and 
vice versa. Final ranks are averaged over 20 times of 
resampling. 


(2) Quantization levels & K-NNFot real experiments 
with 2000 nominal training points, we fix fc = 20 and 
m = 3. These values are based on noting that the 
detection performance does not degrade significantly 
with smaller quantization levels for synthetic data. 
The k parameter in ilT-NN is chosen to be 20 and is 
based on Theorem |T] and results from synthetic exper¬ 
iments (see below). 


(3) Cross Validation using pairwise disagreement loss: 
For the rank-SVM step we use a 4-fold cross valida¬ 
tion to choose the parameters C and a. We vary C € 
{0.001,0.003,0.01,..., 300,1000}, and the RBF kernel 
parameter cr e E = {2'^DK,i = —10, —9,..., 9,10}, 
where Dk is the average 20-NN distance over nominal 
samples. The pair-wise disagre ement indicator loss is 
adopted from [Lan et al.L 12012 1 for evaluating rankers 
on the input pairs: 


^(/) = l{/(^i)</(^j)} 

(ij)eP 

Reported AUC performances are averaged over 5 runs. 


6.2 Synthetic Data sets 

We first apply our method to a Gaussian toy problem, 
where the nominal density is: 

/o ^ 0.2W ([5; 0], [1,0; 0,9])+0.8Ar ([-5; 0], [9,0; 0,1]). 

Anomaly follows the uniform distribution within 
{(x,y) : —18 < X < 18,-18 < y < 18}. The goal 
here is to understand the impact of different param¬ 
eters (/c-NN parameter and quantization level) used 
by RankAD. Fig.2 shows the level curves for the es¬ 
timated ranks on the test data. As indicated by the 
asymptotic consistency (Thm.2) and the finite sample 
analysis (Thm.3), the empirical level curves of rankAD 
approximate the level sets of the underlying density 
quite well. We vary k and m and evaluate the AUC 
performances of our approach shown in Table [I] The 
Bayesian AUC is obtained by thresholding the likeli¬ 
hood ratio using the generative densities. From Table 
|T]we see the detection performance is quite insensitive 



Figure 2: Level sets for the estimated ranks. 600 training 
points are used for training. 

to the fc-NN parameter and the quantization level pa¬ 
rameter TO, and for this simple synthetic example is 
close to Bayesian performance. 

Table 1: AUC performances of Bayesian detector, aK- 
LPE, and rankAD with different values of k and m. 600 
training points are used for training. For test 500 nominal 
and 1000 anomalous points are used. 


AUC 

to 

II 

o 

II 

k=20 

O 

II 

m=3 

0.9206 

0.9200 

0.9223 

0.9210 

m=5 

0.9234 

0.9243 

0.9247 

0.9255 

m=7 

0.9226 

0.9228 

0.9234 

0.9213 

m=10 

0.9201 

0.9208 

0.9244 

0.9196 

aK-LPE 

0.9192 

0.9251 

0.9244 

0.9228 

Bayesian 

0.9290 

0.9290 

0.9290 

0.9290 


Table 2: Data characteristics of the data sets used in 
experiments. N is the total number of instances, d the 
dimension of data. The percentage in brackets indicates 
the percentage of anomalies among total instances. 


data sets 

N 

d 

anomaly class 

Annthyroid 

6832 

6 

classes 1,2 

Forest Cover 

286048 

10 

class 4 vs. class 2 

HTTP 

567497 

3 

attack 

Mamography 

11183 

6 

class 1 

Mulcross 

262144 

4 

2 clusters 

Satellite 

6435 

36 

3 smallest classes 

Shuttle 

49097 

9 

classes 2,3,5,6,7 

SMTP 

95156 

3 

attack 


6.3 Real-world data sets 


We con duct experiment s on several real data sets 
used in Liu et ah . l2008l | and Ting et ah . 201(lj |. in¬ 
cluding 2 n etwork intrusion data sets HTTP and 
SMTP from Yamanishi et ah , 200Cll| . Annthyroid, For- 
est Cover Type, Satel l ite, S huttle from UCI repository 
Frank and Agunciop j_ 2010l. Mammogr aphy and Mul- 


cross from Rocke and Woodrnfl . Il996j . Table [2] illus¬ 
trates the characteristics of these data sets. 
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Table 3: Anomaly detection AUC performance and test stage time of various methods. 


Data Sets 

rankAD 

one-class svm 

BP-AT-NNG 

aK-LPE 

iForest 

massAD 


Annthyroid 

0.844 

0.681 

0.823 

0.753 

0.856 

0.789 


Forest Cover 

0.932 

0.869 

0.889 

0.876 

0.853 

0.895 


HTTP 

0.999 

0.998 

0.995 

0.999 

0.986 

0.995 

ATIC 

Mamography 

0.909 

0.863 

0.886 

0.879 

0.891 

0.701 


Mulcross 

0.998 

0.970 

0.994 

0.998 

0.971 

0.998 


Satellite 

0.885 

0.774 

0.872 

0.884 

0.812 

0.692 


Shuttle 

0.996 

0.975 

0.985 

0.995 

0.992 

0.992 


SMTP 

0.934 

0.751 

0.902 

0.900 

0.869 

0.859 


Annthyroid 

0.338 

0.281 

2.171 

2.173 

1.384 

0.030 


Forest Cover 

1.748 

1.638 

8.185 

13.41 

7.239 

0.483 


HTTP 

0.187 

0.376 

2.391 

11.04 

5.657 

0.384 


Mamography 

0.237 

0.223 

0.981 

1.443 

1.721 

0.044 


Mulcross 

2.732 

2.272 

8.772 

13.75 

7.864 

0.559 


Satellite 

0.393 

0.355 

0.976 

1.199 

1.435 

0.030 


Shuttle 

1.317 

1.318 

6.404 

7.169 

4.301 

0.186 


SMTP 

1.116 

1.105 

7.912 

11.76 

5.924 

0.411 


We randomly sample 2000 nominal points for training. 
The rest of the nominal data and all of the anomalous 
data are held for testing. Due to memory limit, at 
most 80000 nominal points are used at test time. The 
time for testing all test points and the AUC perfor¬ 
mance are reported in Tabled 

We observe that while being faster than BP-AT-NNG, 
aK-LPE and iForest, and comparable to one-class 
SVM during test stage, our approach also achieves 
superior performance for all data sets. The density 
based aK-LPE and BP-AT-NNG has somewhat good 
performance, but its test-time degrades with training 
set size. massAD is very fast at test stage, but has 
poor performance for several data sets. 


One-class SVM Comparison The baseline one-class 
SVM has good test time due to the similar 0[dSi) 
test stage complexity where denotes the number of 
support vectors. However, its detection performance 
is pretty poor, because one-class SVM training is in 
essence approximating one single a-percentile density 
level set. a depends on the parameter of one-class 
SVM, which essentially controls the f raction of points 


viola ting the max-margin constraints [Scholkopf et al 


I 2 OOI I . Decision regions obtained by thresholding with 
different offsets are simply scaled versions of that par¬ 
ticular level set. Our rankAD approach significantly 
outperforms one-class SVM, because it has the ability 
to approximate different density level sets. 


aK-LPE & BP-K-NNG Comparison: Computation¬ 
ally RankAD significantly outperforms density-based 
aK-LPE and BP-AT-NNG, which is not surprising 
given our discussion in Sec.4.3. Statistically, RankAD 
appears to be marginally better than aK-LPE and BP- 
AT-NNG for many datasets and this requires more care¬ 


ful reasoning. To evaluate the statistical significance 
of the reported test results we note that the number 
of test samples range from 5000-500000 test samples 
with at least 500 anomalous points. Consequently, we 
can bound test-performance to within 2-5% error with 
95% confidence (< 2% for large datasets and < 5% for 
the smaller ones (Annthyroid, Mamography, Satellite) 
) using sta ndard extension o f known results for test-set 
prediction LangfordI 12005 1. After accounting for this 
confidence RankAD is marginally better than aK-LPE 
and BP-AT-NNG statistically. For aK-LPE we use re¬ 
sampling to robustly ranked values (see Sec. 6.1) and 
for RankAD we use cross-validation (CV) (see Sec. 6.1) 
for rank prediction. Note that we cannot use CV for 
tuning predictors for detection because we do not have 
anomalous data during training. All of these argu¬ 
ments suggests that the regularization step in RankAD 
results in smoother level sets and better accounts for 
smoothness of true level sets (also see Fig l6.2l) in some 
cases, unlike NN methods. 


7 Conclusions 

We presented a novel anomaly detection framework 
based on combining statistical density information 
with a discriminative ranking procedure. Our scheme 
learns a ranker over all nominal samples based on 
the fc-NN distances within the graph constructed from 
these nominal points. This is achieved through a pair¬ 
wise learning-to-rank step, where the inputs are prefer¬ 
ence pairs (xi, Xj) and asymptotically models the situ¬ 
ation that data point Xi is located in a higher density 
region relative to xj. We then show the asymptotic 
consistency of our approach, which allows for flexible 
false alarm control during test stage. We also provide 
a finite-sample generalization bound on the empirical 
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false alarm rate of our approach. Experiments on syn¬ 
thetic and real data sets demonstrate our approach has 
state-of-art statistical performance as well as low test 
time complexity. 
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Appendix: Proofs of Theorems 

For ease of development, let n = mi (m 2 + 1), and divide n data points into: D = Do U Di U ... U Dmi, where Do = 
{xi, and each Dj,j = 1, ...,mi involves m 2 points. Dj is used to generate the statistic rj for u and Xj € Do, for 

j = 1, ...,mi. Do is used to compute the rank of u\ 


R{u) 


mi 

— E' 




We provide the proof for the statistic G{u) of the following form 


(7) 


G{u-D^) 



( 8 ) 


where D(i)(u) denotes the distance from u to its i-th nearest neighbor among m 2 points in Dj. Practically we can omit 
the weight and use the average of 1-st to l-st nearest neighbor distances as shown in Sec. 3. 

Regularity conditions: /(•) is continuous and lower-bounded: f{x) > fmin > 0. It is smooth, i.e. ||V/(x)|| < A, where 
V/(x) is the gradient of /(■) at x. Flat regions are disallowed, i.e. Vx G X, Vcr > 0, P {j/ : |/(j/) — /(x)| < cr} < Ma, 
where M is a constant. 


Proof of Theorem 1 

The proof involves two steps: 

1. The expectation of the empirical rank E [R(u)] is shown to converge to p{u) as n —>■ 00 . 

2. The empirical rank R{u) is shown to concentrate at its expectation as n —^ 00 . 

The hrst step is shown through Lemma |4] For the second step, notice that the rank R{u) — i where 

Yj = ^{ri(xj;Dj)>ri(u-,Dj)} Is independent across different j’s, and Yj G [0, 1]. By Hoeffding’s inequality, we have: 

P (|R(m) — E [R(m)] I > e) < 2 exp (—2mie^) (9) 

Combining these two steps finishes the proof. 

Lemma 4. By choosing I properly, as m 2 —>■ 00 , it follows that, 


\E[R{n)]- 

p{u)\ —s- 0 

Proof. Take expectation with respect to H: 



Ed [H(u)] = Ed\Do 

Edo 

■ ^ mi 

^ ^{r}iu-,Dj)<rj(xj-Dj)} 

J = 1 J . 


^ mi 
flol 

= ExIVdi Di) < p{x- Di))] (12) 

The last equality holds due to the i.i.d symmetry of {xi,..., Xmi} and Hi,..., Dmi ■ We fix both u and x and temporarily 
discarding E_Di. Let Fx{yi, ...,ym2) ~ v{x) — »?(u), where yi, ■■■,ym2 are the m 2 points in Hi. It follows: 

Pdi (riiu) < n{x)) = Pdi {Fx{yi,..., j/ma) > 0) = Pdi {Fx - EFx > -EH^,). (13) 

To check McDiarmid’s requirements, we replace yj with j/) . It is easily verihed that Vj = 1,..., m 2 , 

1 2(7 4(7 

\Fx{yi ,..., t/ma) “ Fx{yi, ...,yj, j/ma)! <2^ -j- < -j- (14) 

where G is the diameter of support. Notice despite the fact that yi,...,ym2 are random vectors we can still apply 
MeDiarmid’s inequality, because according to the form of 77 , He(2/i, ..., i/ma) is a function of m 2 i.i.d random variables 
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ri,...,r„i 2 where ri is the distance from x to yi. Therefore if EEc < 0, or £ 77 ( 0 ;) < £ 77 ( 17 ), we have by McDiarmid’s 
inequality, 

Vd^ iviu) < v{x)) = Vd, (E, > 0) = Vd, {F^ - EE, > -EE,) < exp (" ) (15) 

Rewrite the above inequality as: 




I{EF,>0} “ 6 SC'-‘m2 < > 0) < I{EF,>0} + 6 


(EFx)^H 

8C'^m2 


(16) 



(EF,)2i2 


(EF,)2i2 

E, (EE, > 0) - E, 

g 8C^ m 2 

< E [Edi (E, > 0)] < E, (EE, > 0) + E, 

g 8C^ m 2 


It can be shown that the same inequality holds for KFx > 0, or Kr){x) > Kri{u). Now we take expectation with respect to 
x: 

(EF^')2j2l P (EFt)^I'^ 

(17) 

Divide the support of x into two parts, Xi and X 2 , where Xi contains those x whose density f{x) is relatively far away from 
f{u), and X 2 contains those x whose density is close to f{u). We show for a; G Xi, the above exponential term converges 

/ tr \ 1 /*^ 

to 0 and P (EE, > 0) = E, (/(u) > f{x)), while the rest a: € X 2 has very small measure. Let A{x) = ijXxJc^pn^ ) • 

Lemma [ 5 ] we have: 


|E 77 (a:) — A{x)\ < 7 


J_\^ 

m 2 


A{x) < 7 


I 


^7712 J \ fminCdm,2 

where 7 denotes the big O(-), and 71 = 7 ( ^—— j . Applying uniform bound we have: 


71 

Ijd 


m 2 . 


(18) 


d+1 


I 


< E [ 77 ( 0 ;) - 77 ( 77 )] < A{x) - A{u) + 2 I ^ 


i/d 


I 


m 2 


(19) 


Now let Xi = {r : |/(a:)-/( 77 )| > 871 (tIi) h- For a; G Xi, it can be verified that jA(a;)-^( 77 )] > 3 

or |E [77(a:) - 77(77)] I > (777)^- and !{/(„)>/(,)} = 1 I{e,,(,:)>e,,(^)}- For the exponential term in Equ.([T6l) we have: 


exp 


2C-m2)-^^A 8C2c|mr 


( 20 ) 


For X e X 2 = {x : |/(x) — f{u)\ < fr^}: t>y the regularity assumption, we have 'P(X 2 ) < 

SMyid fr^- Combining the two cases into Equ. Gzl) we have for upper bound: 

Ed [E(w)] = E, [Vdi iviu) < r]{x))] (21) 

= [ Edi ( 77 ( 77 ) < 77(x))/(a:)da: + / Eoi ( 77 ( 77 ) < 77 ( 3 ;))/(a;)da: (22) 


< E, (/(77)>/(a:)) + exp -- 


7 i« 


< E, (/(77) > f{x)) + exp - 


8C‘^cJm2 


P{x€Xi)+P{x€: 


+ 3M7rd/^t (i 


(23) 

(24) 


Let I = 777,2 such that 2 d+i < < !> and the latter two terms will converge to 0 as m 2 00 . Similar lines hold for the 

lower bound. The proof is finished. □ 


Lemma 5. Let A{x) = (- ^-rr^ 1 , Ai = -r^ 

' ' y’Tl'CdfW / ’ fmi, 

distance ED(;)(a;) among m points satisfies: 


/ 1 c \ i/d 

\ cdf ) ’ choosing I appropriately, the expectation of l-NN 


\ED^i){x)-A{x)\ = 0 


^A(a;)Ai 



(25) 
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Proof. Denote r{x, a) — min{r : P {B{x, r)) > a}. Let (5m —>■ 0 as m —>• oo, and 0 < Sm < 1/2. Let U ~ Bin{m, 
be a binomial random variable, with ED = (1 + Sm)l. We have: 


P >r(a;,(l + 5m)^)j = V {U < 1) 


= r{u<{i- 


1 + ^77 


(1 + Sm)l 


< exp ( — 


Si I 


2(1 + Sm) ^ 

The last inequality holds from Chernoff’s bound. Abbreviate ri = r(x, (1 + Sm)^), and EZ)(q(x) can be bounded as: 

E_D(q(x) < ri[l-V (-D(/)(a:) > ri)] + CV (-D(/)(x) > ri) 


where C is the diameter of support. Similarly we can show the lower bound: 

I 


ED(i)(a:) > r{x, (1 — Sm )—) — Cexp ( — 


Sl.l 


2(1-<5m) 


Consider the upper bound. We relate ri with A{x). Notice: 


V(B(x,r^)) = (1 + 5™)- > carff, 


so a fixed but loose upper bound is ri < i \ _ Assume l/m is sufficiently small so that ri is sufficiently 

small. By the smoothness condition, the density within B{x,ri) is lower-bounded by f{x) — Ari, so we have: 


P{B{x,ri)) = {l + 5m)—>Cdrf{f{x)-Xri) 

m 


= Cdvffix) 1 - 


A 

fix) 


ri 


> Cdrff{x) 1 - 


That is: 


ri < A{x) 


1 + 5,7 


1 - 


l/d 


Insert the expression of Vmax and set Ai = . ^ ( —^4—i , we have: 

Jmin \ ^dJ min ) 

1 + 5™ 


¥.Dn\{x) — A{x) < A{x) 


< A{x) 


1 + Sm 


1/d 


— 1 + C exp ( — 


Sll 


2(1 + Sm) 


= A(x) 


Sm + Al (/^) 


I A/d 


— 1 + C exp — 


Sll 


2(1 + 5m) 


I A/d 


1-Ai M- 


i 


+ C exp ( — 


Sll 


2(1 + 5m) 


= O A(a:)Ai 


l/d' 


(26) 


3 d +8 _ 

The last equality holds if we choose I = m‘^d+8 and Sm = m 4 . Similar lines follow for the lower bound. Combine these 
two parts and the proof is finished. 

□ 
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Proof of Theorem 2 

We fix an RKHS H on the input space X C with an RBF kernel k. Let x = {xi,... ,Xn} be a set of objects to be 
ranked in with labels r = {ri,... ,r„}. Here Xi denotes the label of Xi, and Vi G R. We assume x to be a random 
variable distributed according to P, and r deterministic. Throughout L denotes the hinge loss. 

The following notation will be useful in the proof of Theorem 2. Take T to be the set of pairs derived from x and dehne 
the L-risk of / € H as 

7^L,p(/) - i^x[7^i,T(/)] 

where 

TIl,tU)= D{ri,rj)L{f{xi) - f{xj)) 

and D{ri,rj) is some positive weight function, which we take for simplicity to be 1/|P|. (This uniform weight is the 
setting we have taken in the main body of the paper.) The smallest possible L-risk in H is denoted 

TZl.p ■— inf TZL,p{f)- 

fan 

The regularized L-risk is 

n:"p.A(/):=A|l/f+7^n,p(/), (27) 

A > 0. 

For simplicity we assume the preference pair set V contains all pairs over these n samples. Let gx.A be the optimal solution 
to the rank-AD minimization step. Setting A = 1/2(7 and replacing C with A in the rank-SVM step, we have: 

5 x,a = argmin77p,T(/)-I-A||/||^ (28) 


Let T-Ln denote a ball of radius 0{1/\/\Z) in H. Let Ck ■= sup^. ^ \k{x,t)\ with k the rbf kernel associated to H. Given 
e > 0, we let N{'H, e/ACk) be the covering number of P by disks of radius e/AC'k ■ We first show that with appropriately 
chosen A, as n —>■ oo, gx.A is consistent in the following sense. 

Lemma 6. Let An be appropriately chosen such that An — ^ 0 and ^ Q, as n — >■ oo. Then we have 

Ax[77i,,t(5x.a)] 77i,,p = min77i,,p(/), n -)■ oo. 

f^H 

Proof Let us outline the argument. In ISteinwartI [20011] . the author shows that there exists a /p,a £ H minimizing 123: 
Lemma 1. For all Borel probability measures P on X x X and all A > 0, there is an /p,a G H with 

T^L^xifpA) - inf ^p,a(/) 


such that ||/p,a|| = (7(1/a/A). 

Next, a simple argument shows that 

lim ^p,®p,a(/ta) = T^l.p- 


Finally, we will need a concentration inequality to relate the L-risk of /p_a with the empirical L-risk of /t.a- We then 
derive consistency using the following argument: 


'P.L,p(/t,A„) < An||/T,A„| 

< An II/t, An I 

< An II /p, An I 

< An II/p, An I 

< 'R-l,p + 5 


+ TZL,p{fT,\n) 

+ P-I,,T(/T.An) +5/3 
+ 'P.p,T(/p,An) + 5/3 
+ 77p,p(/p,An) + 25/3 


where An is an appropriately chosen sequence —>■ 0, and n is large enough. The second and fourth inequality hold due to 
Concentration Inequalities, and the last one holds since limA->o P./Zp a(/t^) = 'P-l,p- 

We now prove the appropriate concentration inequality ICucker and SmaI3 [200 ll] . Recall H is an RKHS with smooth 
kernel fc; thus the inclusion Ik ■ H ^ C!{X) is compact, where (7(X) is given the H-Hoo-topology. That is, the “hypothesis 
space” P Ik{Bp) is compact in (7(X), where Bp denotes the ball of radius R in H. We denote by N{P, e) the covering 
number of P with disks of radius e. We prove the following inequality: 
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Lemma 2. For any probability distribution P on X x X, 


P^-{T € (X X X)'” : sup |7^L,T(/) - 7^n,p(/)| > e} < 2N{H, e/ACk) exp 
fen 


2(l + 2yai?)2 


where Ck '■= sup^ j \k{x,t)\. 

Proof Since P is compact, it has a finite covering number. Now suppose P = Di U • • • U Dr is any finite covering of P. 
Then 

i 

Prob{sup \PL,T{f) - PL,p{f)\ > e} < Prob{ sup \PL,T{f) - PL,p{f)\ > e} 
fen ^ f€Dj 

so we restrict attention to a disk D inP of appropriate radius e. 

Suppose 11/ — giloo < £• We want to show that the difference 

\{PL,T{f) - PL,p{f)) - {PL,T{g) - PL,p{g))\ 
is also small. Rewrite this quantity as 

\{PL,T{f) - PL,T{g)) - Dx[7^L,T(g) - PL,T{f)]\- 
Since ||/ — g||oo < e, for e small enough we have 

max{0,1 - {f{xi) - f{xj))} - max{0,1 - {g{xi) - g{xj))} = max{0, {g{xi) - g{xj) - f{xi) + f{xj))} 

= max{0, {g - f, <f){xi) - cj){xj))}. 

Here (f> : X ^ H is the feature map, 4>{x) ;= k{x, •). Combining this with the Cauchy-Schwarz inequality, we have 

|(7^i.T(/)-7^i,T(g))-Dx[7^i,T(g)-7^i,T(/)]| < - aUCk) <ACke, 

where Ck '■= sup„, j \ k{x,t)\. From this inequality it follows that 

|7^n,T(/)-7^i,p(/)| > (4C7fc + l)£ |(7^i^y(p) _ 7^i,p(3))| > e. 

We thus choose to cover P with disks of radius ejACk, centered at /i,...,/r. Here i = N{P,e/ACk) is the covering 
number for this particular radius. We then have 

sup |(7^n,T(/)-7^n,p(/))| >2£ => |(7^p,p(/^) - 7^p.p(/,■))| > e. 

feOj 


Therefore, 

n 

Prob{sup |7^p,T(/) -7^p.p(/)| > 24 < V Prob{|7^p,T(/i) - 7^p,p(/4| > 4 
fen ^ 

The probabilities on the RHS can be bounded using McDiarmid’s inequality. 

Define the random variable g(a:i, ..., Xn) '■= PL,T{f), for fixed f £ H. We need to verify that g has bounded differences. 
If we change one of the variables, Xi,in g to x'i, then at most n summands will change: 


\g{xi,...,xi,...,x„) - g{xu-- 


Xi,...,Xn)\ < -^2nsup\l - if{x) - f{y))\ 

^ cc,y 

- “ + II 

^ ^ x,y 

<- + -Vc^\\f\\. 

n n 


Using that sup^-g.^^H/H < R, McDiarmid’s inequality thus gives 


Prob{sup |7?.p,t(/) - 7?.p,p(/)| > e} < 2N{P, ejACk) exp 
fen 


2 (l + 24aR)2 




Learning Efficient Anomaly Detectors from A-NN Graphs 


We are now ready to prove Theorem 2. Take R — |1/p,a|| and apply this result to fp,x'- 
Prob{|7^n.T(/p,A) -7^i,p(/p.A)| > e} < 2iV(H, e/dCfc) exp 

nX„ 


Since ||/p.a„|| = 0{l/x/X^), the RHS converges to 0 so long as 
proof of Theorem 2. 


logN{H,e/4Ck) 


2 (l + 2y^||/p,A||)V ■ 

oo as n —>■ oo. This completes the 


We now establish that under mild conditions on the surrogate loss function, the solution minimizing the expected surrogate 
loss will asymptotically recover the correct preference relationships given by the density /. 

Lemma 7. Let L be a non-neqative, non-increasinq convex surroqate loss function that is differentiable at zero and 
satisfies L'{0) <0. If 

g* = argminEx [R-L,Tig)], 

geH 

then g* will correctly rank the samples according to their density, i.e. Vxi ^ Xj,f(xi) > f{xj) => g*{xi) > g*{xj). 
Assume the input preference pairs satisfy: V = {{xi,Xj) : f{xi) > f{xj)}, where x = {ri,... ,Xn} is drawn i.i.d. from 
distribution f. Let £ be some convex surrogate loss function that satisfies: (1) £ is non-negative and non-increasing; 
(2) £ is differentiable and £'((f) < 0. Then the optimal solution: q*, will correctly rank the samples according to f, i.e. 
g*{xi) > g*(xj), Vxi ^ Xj,f{xi) > f{xj), . 

The hinge-loss satisfies the conditions in the above theorem. Combining Theorem|6]and[3 we establish that asymptotically, 
the rank-SVM step yields a ranker that preserves the preference relationship on nominal samples given by the nominal 
density /. 

Proof Our proof follows similar lines of Theorem 4 in lLan et al.l [2012l| . Assume that g(xi) < g{xj), and define a function 
g' such that g'(xi) = g{xj), g'{xj) = g{xi), and g'{xk) = g{xk) for all k 7 ^ i,j. We have IZL,p{g') - TiL,p{g) = Ex(A(x)), 
where 


M^)= [D{rk,rj) - D(rk,ri)][L(g(xk) - g(xi)) - L(g(xk) - g{xj))] 

k:rj<ri<rk 

+ D{ri,rk)[L(g(xj) - g{xk)) - L{g(xi) - g(xk))] 

k:rj<r^<ri 

+ D{rk,rj)[L(g(xk) - g(xi)) - L{g{xk) - g(xj))] 

k:rj<r^<ri 

+ [D{rk,rj) - D{rk,ri)][L{g(xk) - g{xi)) - L{g{xk) - g(xj))] 

k:rj<ri<rk 

+ [D(ri,rk) - D{rj,rk)][L{g(xj) - g(xk)) - L{g(xi) - g(xk))] 

k:rj<ri<r^, 

+ (M9(xj) - g{xi)) - L(g(xi) - g{xj)))D{ri,rj). 


Using the requirements of the weight function D and the assumption that L is non-increasing and non-negative, we see 
that all six sums in the above equation for A(x) are negative. Thus A(x) < 0, so 'R-L,p(g') — H-L.pig) = ifx(A(x)) < 0, 
contradicting the minimality of g. Therefore g{xi) > g{xj). 


Now we assume that g(xi) = g{xj) = go. Since TZL,p{g) 


d£L{g-,x) 

dg{xj) 


= B — 0, where 


inf heH R-L.pih), we have 


d£L(g\x) 

dg{xi) 


50 


^ 0, and 


A= ^ II(rk,ri)l-L'(g(xk) - go)] + ^ D(ri,rk)L'(go - g(xk))+ 

k:rj<ri<rk k:rj<rk<ri 

^ D(ri,rk)L'(go - g(xk)) + D(ri,rj)[-L'((>)]. 

k\r],,<rj<ri 


B= ^ D(rk,rj)[-L'(g(xk) - go)]+ ^ D(rk,rj)L'{go-g(xk)) + 

fc:rj<ri<rs, k:rj<rf,<ri 

D{rj,rk)L'{go - g{xk)) + D{ri,rj)[-L'(0)]. 

k\rk<rj<ri 


50 
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However, using L'{0) < 0 and the requirements of D we have 


A-B< 2L'{0)D{ri,rj) < 0, 


contradicting A = B = 0. 


The following lemma completes the proof of Theorem 2: 

Lemma 8 . Assume G is any function that gives the same order relationship as the density: G{xi) > G{xj), Vxi ^ Xj 
such that f{xi) > f{xj). Then 

1 ^ 

- 'n l{a{^i)<a{r,)} Piv)- (29) 

i=l 

Proof of Theorem 3 


To prove Theorem 3 we need the following lemma IVaDnild [19791 ]: 

Lemma 3. Let T be a set and S a system of sets in X, and P a probability measure on S. For X G T" and A G S', define 
r'xiA) := jX n A\/n. If n > 2/e, then 

P" |x : sup \py.{A) - P(A)| > el < 2P^" |xx' : sup |i/x(A) - i/x'(^)| > e/2 

L J L 


Proof Consider the event 


J;=<^XGT":3/GP,P{a;:/(2:)</ 


(m) 


27} > 


m — 1 
n 



We must show that P^{J) < 5 for e = e(n, k,5). Fix k and apply lemma [3] with 

A = {x : f{x) < - 27 } 

with 7 small enough so that 


Vyi[A) = [{xj G X : f{xj) < - 27 }I/n = 


m — 1 


We obtain 


P"(J) < 2P^"|XX' : 3/ G P, \{x/ G X' : /(xj) < - 27 }] > en/2 

The remaining portion of the proof follows as Theorem 12 in IScholkopf et aP [20011] . 
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